This report is a portion of the AMIA 2015 Tutorial on Using R for Healthcare Data Science. All code and data available at my GitHub page.

Introduction

This report will walk you through the data scientist’s workflow and how recent R packages make data science easier and more intuitive. First, let’s start with a couple of disclaimers:

  1. This tutorial is to give you a sense of what is possible with R and motivate you to learn more - not to teach you every detail of code or packages available.
  2. We will not spend extensive time on data modeling. This tutorial is intended to work through data janitor tasks and reporting - in my experience some of the most time consuming tasks of data science.

To illustrate how packages released over the past few years have made these tasks easier we will walk through an entire analysis plan using data published by the International Warfarin Pharmacogenomics Consortium available on the PharmGKB website.

The Data Scientist’s Workflow

Taking a cue from David Rob1, Data Scientist at Stack Overflow, and Philip Guo2, Assistant Professor of Computer Science at University of Rochester, here is my view of the primary computational data science workflow:

The preparation phase of the workflow involves:

  1. Getting data into R
  2. Data Tidying
    1. following principles of tidy data3
    2. ensuring correct data types
  3. Data Manipulation to prepare for analysis
    1. adjusting date/times
    2. parsing strings
    3. creating/combining variables

The analysis phase consists of:

  1. Data Modeling (e.g., statistics, machine learning etc.)
  2. Model Tidying and Manipulation
    1. turn R model objects into clean tables
    2. compare different models
  3. Data Visualization
    1. graphs and tables of data
    2. graphs and table of model results

Finally, the dissemination phase to share the results of their work:

  1. Writing Reports (e.g., technical reports that show analysis steps - great for sharing with analysts, and reproducible research)
  2. Publishing (either as formatted journal articles or reports for non-technical readers)
  3. Web Applications (interactive visualization tools)

Over the past few years the growth in tools aiding these steps has been phenomenal. We will cover each of these as we move through the workflow steps, but here is a summary of the different packages I’ve found useful for these steps:

Preparation Phase

Getting Data into R

Let’s load up our IWPC data! We will be using a slightly modified form of the main data set, that I have manually turned into a tab delimited text file. Although there are a number of libraries to read in excel files, the non-standard column names in the data set make it easier to work with a tsv. We are going to use read.delim() as opposed to readr’s read_tsv() for two reasons:

  1. The non-standard column names (contains spaces, returns, and symbols)
  2. Changing data types. This data is from a consortium and contains different types of data in each column based on the study site.

This last reason is the deal breaker for readr. Readr interpolates the variable type (column, date, number, etc.) based on the first 100 rows or via manual specification. Given the large number of columns (78) this becomes annoying at best. However, since we can’t take advantage of readr automatically making a tbl_df() object, so we will have to do so manually.

iwpc_data <- read.delim(file = "iwpc_data_7_3_09_revised3.txt") %>% tbl_df()

Let’s take a look at the type of data we are working with.

Tidying your Data

Looking at our data above we see there are a number of problems:

  1. Long, non-standard column names
  2. Lots of columns
  3. Lots of different types of entries for each column

Manipulating Data

Working with Dates

Working with Text

Analysis Phase

Modeling

Model Tiying and Manipulation

Data and Model Visualization

Dissemination Phase

Writing Reports and Publishing

Deploying Online Tools


  1. http://varianceexplained.org/r/broom-slides/

  2. http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

  3. http://www.jstatsoft.org/v59/i10/paper